Mapping reads¶
There are several steps involved in mapping our sequence reads and getting the output into a usable form. First we need to tell bwa to make an index of the reference genome. This is a way to set up the genome data in a way that will be computationally efficient for accessing. You’ll see this is a common theme in bioinformatics pipelines.
bwa index genome.fa
Now we are ready to do the mapping. It takes two commands. The
bwa aln
command aligns the reads and produces a file of all the
possible candidates. bwa samse
takes that file and the knowledge it
is single-end sequencing to process the alignment data to produce the
resulting SAM file where repetitive hits will be randomly chosen.
bwa aln genome.fa SRR346368_TAGCGT.fastq > SRR346368_TAGCGT_Ys_genome.sai
bwa samse genome.fa SRR346368_TAGCGT_Ys_genome.sai SRR346368_TAGCGT.fastq > SRR346368_TAGCGTvsYsGenome_bwa.sam
See manual and Biostars question on bwa alig and bwa samse or Biostars question on what is a .sai file for a full and two quick guides, respectively, to what is going on with running the two commands.
Note if you are working with paired-ends you’d use bwa sampe
to
generate your SAM file, providing the information for both reads in your
command. This would use the file created by bwa aln
and the
knowledge that it is paired-end sequencing to select the best hits
according the orientations, distance between the reads, etc.. An
unrelated example command for paired-ends is below as an EXAMPLE. NOT
FOR RUNNING TODAY.
bwa sampe dmel-all-chromosome-r5.37.fasta RAL357_1.sai RAL357_2.sai RAL357_1.fastq RAL357_2.fastq > RAL357_bwa.sam
BWA is able to take advantage of multi-threading and so you’ll get more speed if you have multiple cores on your computer which many have. For example, my iMac at home has two cores while my iMac at work has four and several of Amazon’s Elastic Cloud computing instances have 4 or more cores as summarized here. The example we chose for today has few reads to map and so it is fast without running on multiple threads. However, if you had a lot of reads it would be preferable to run the alignment on multiple threads at the same time using your machines multi-core processor. To do so you use the -t option to specify the number of threads. For example, to run in four simultaneous threads:
bwa aln -t 4 genome.fa SRR346368_TAGCGT.fastq > SRR346368_TAGCGT_Ys_genome.sai
(You can use a UNIX trick to let the computer determine how many cores
it has and thus how many threads it can run. On Linux, replace 4
on
the above line with $(nproc --all)
as illustrated
here.
On a Mac, you’d replace 4
with sysctl -n hw.ncpu
.)
Now look at the SAM output file
head -100 SRR346368_TAGCGTvsYsGenome_bwa.sam
As summarized here
The first four columns, in order, are: identifier, flag, chromosome, position, map-quality.
One of the key pieces of information is found in the second column. See here for a nice summary of how to interpret it for unpaired reads. For paired reads interpreting is more complex as described there as well. For full details of the SAM format specification see here. A nice primer can be found here.
We see a lot of ‘4’s indicating only a few mapped. Can we get some feedback easily? We can if we use the SAMtools software to look at the SAM file. Type:
samtools
You’ll see a listing that includes:
flagstat simple stats
That looks promising. As described in post #5
here it isn’t
documented in the
manual. Let’s try
flagstat
anyways.
samtools flagstat SRR346368_TAGCGTvsYsGenome_bwa.sam
We get:
EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
Plus a bunch of zero
results.
Despite the name, it seems it needs a BAM file to do this. Well, a BAM file is just a compressed binary version of a SAM file according to the list of Data File Formats at UCSC Genome Bioinformatics.
The first example under Examples on SAMtools documentation page (WAY AT THE BOTTOM) illustrates what we’d like to do.
samtools view -bS SRR346368_TAGCGTvsYsGenome_bwa.sam > SRR346368_TAGCGTvsYsGenome_bwa.bam
(According to here the -S
tag for samtools view
had in the past been used to be to specify
that the input is in the SAM format. That flag looks to be unnecessary
now according to the doccumentation for the updated
version. The way shown
above should work no matter what version of SAMtools is on your system.)
Now since it is a binary and compressed version of a SAM file, unlike the SAM file we just examined, a BAM file is not human-readable in any way. If you need convincing, just type:
head SRR346368_TAGCGTvsYsGenome_bwa.bam
However, samtools flagstat
hopefully will work with this now. Try:
samtools flagstat SRR346368_TAGCGTvsYsGenome_bwa.bam
Only 4.44% mapped?!??!
Improving the mapping of the reads¶
Let’s review the methods and see how it compares to what we did. (Fortunately, in this case they detailed parts of the computational analyses right in the supplemental information of the paper. You won’t always be this lucky.)
From the supplemental methods information in paper:
Alignment to genome, peak calling, and data sharing. The Saccharomyces reference genome was obtained from www.yeastgenome.org (build: 19-Jan-2007). The entire length of the sequenced tags were aligned to the reference genome using Corona Lite software provided by the SOLiD system, allowing up to 3 mismatches. This process was repeated for the remaining tags, after removal of the 3’ most 6 bp, which tend to have higher error rates.
As noted in the supplemental methods information they allowed mismatch
to be 3, the default for BWA listed as the first
option is 4% of read
length which is much smaller than 3. So let’s try what they did by
setting that -n
option we just examined.
bwa aln -n 3 genome.fa SRR346368_TAGCGT.fastq > SRR346368_TAGCGT_Ys_genomeN3.sai
bwa samse genome.fa SRR346368_TAGCGT_Ys_genomeN3.sai SRR346368_TAGCGT.fastq > SRR346368_TAGCGTvsYsGenome_bwaN3.sam
Let’s convert that to BAM and look at the stats.
samtools view -bS SRR346368_TAGCGTvsYsGenome_bwaN3.sam > SRR346368_TAGCGTvsYsGenome_bwaN3.bam
samtools flagstat SRR346368_TAGCGTvsYsGenome_bwaN3.bam
Indeed, improvement. Nearly 6% now.
The supplemental methods for the paper go on to describe that they took these unmapped reads and cut off the last six and then ran the mapping with those again. Obviously they are sharing some of the results if their quality assessment here and this is in part while we didn’t address earlier. We skipped running FastQC as the very first step in the interest of time and because they reported related information.) How would we do such cutting off of bases?
As you probably guessed, with more software.
First we need a way to get the unmapped reads. To get them, we’ll again rely on SAMtools.
samtools view -f 4 SRR346368_TAGCGTvsYsGenome_bwaN3.bam > SRR346368_TAGCGTvsYsGenome_bwaN3.unmapped.sam
That filters the reads that have 4
for the flag
value.
(We won’t need the @SQ header for this file so we left off the -h
option.) You can look at the output using the head
command.
The filtering created a SAM file with the unmapped, but we need to supply bwa with a fastq file for the mapping step. As described here, you have options for doing this. We are going to take advantage of a unix utility called Awk to do this, adapting the solution described here.
grep -v ^@ SRR346368_TAGCGTvsYsGenome_bwaN3.unmapped.sam | awk '{print "@"$1"\n"$10"\n+\n"$11}' > SRR346368_TAGCGTvsYsGenome_bwaN3.unmapped.fastq
The description of the approach nicely breaks down the relevant steps.
One thing it assumes is you know |
stands for using pipes to pass
the output from one command to another’s input. The Practical Computing
for Biologists book I recommended
covers this and redirects if you need guidance about pipes with a
biological bent beyond what the internet provides.
You can look at the output using the head
command and see we have
our fastq file of unmapped reads.
As for what software for trimming the reads... this is one of those processes where a lot of solutions exist. You could even build a customized one. For example, the Python program we used earlier to sort based on barcode could easily be copied and adapted to make a new program that takes every read and collects all except the last 6 bases and outputs it to a new file. One of the most popular set of tools for post-processing raw short reads for downstream approaches is the FASTX-Toolkit from the Hannon lab. There are web-based versions within Galaxy and command line versions. It is a handy set of programs to be able to run and so we’ll use a program from the collection today for trimming, fastx_trimmer, as having the tools installed will present additional post-processing abilities to you, including a more full-featured approach to demultiplexing based on barcodes.
The Fastx toolkit requires you install libgtextutils first. Once you have those installed you are ready to trim. (I have installed them for you today.)
fastx_trimmer -l 29 -i SRR346368_TAGCGTvsYsGenome_bwaN3.unmapped.fastq -o unmapped.trimmed.fastq
Now we have the unmapped reads in a trimmed form. To make the steps after easier, let’s take the untrimmed reads that mapped previously and append them to the unmapped reads in a trimmed form.
Since we saw before that filtering on the 4 flag worked, the solution to getitng the mapped reads is to instead keep all WITH THE EXCEPTION OF the unmapped.
samtools view -F 4 SRR346368_TAGCGTvsYsGenome_bwaN3.bam > bwaN3.mapped.sam
(Note that one solution you may have thought of is that you could
separately collect those that mapped to the forward and the reverse
using the -f
option to filter on 0
and 16
, respectively.
DON’T. While it will work for filter on 16
to get those mapped to
the reverse, it fails to filter out anything if you try to filter on
0
. Perhaps counterintuitively, to get those
forward mapping reads
you can see in the SAM file as having a 0
value for the flag, you need to actually use -F 20
as the option
setting. The reason the filter on
0
approach fails and the F -20
option works is because the flag
is bitwise so that the values can
be combined to express complex sets in a compact,’computer
science-wisey’ way. This tool
here is
particularly helpful for novices deciphering this. In this tool’s
flag
box, enter the flag value of 20
and hit explain
you see
that is a way of expressing those that are unmapped or map to the
reverse strand. With -F
option we exclude those. Compounding the
complex nature of this encoding is the fact it is hexidecimal based. You
may wish to seek additional help on this concept by looking at slides
28-31 of this
presentation
and information and links or
here or
herehere
or here or here as mentioned
earlier or
swbarnes’ answer
here.)
Again, we need to convert these to fastq as we did before.
grep -v ^@ bwaN3.mapped.sam | awk '{print "@"$1"\n"$10"\n+\n"$11}' > bwaN3.mapped.fastq
Now that we have a fastq file of just the mapped reads we are going to
start a combined
file and append those to the unmapped. We’ll use
the redirection symbol >>
to designate appending.
cat unmapped.trimmed.fastq > combined.fastq
cat bwaN3.mapped.fastq >> combined.fastq
Finally, we should be able to emulate the mapping approach used in the paper.
bwa aln -n 3 genome.fa combined.fastq > combined_Ys_genomeN3.sai
bwa samse genome.fa combined_Ys_genomeN3.sai combined.fastq > combined_vs_YsGenome_bwa.sam
The first line of that should take about three minutes.
And convert to a BAM file so we can use flagstat.
samtools view -bS combined_vs_YsGenome_bwa.sam > combined_vs_YsGenome_bwa.bam
So how did we do?
samtools flagstat combined_vs_YsGenome_bwa.bam
Alright, 13% mapped is a big improvement over the 6% we had. But normally you might expect better. I have communicated with Ho Sung Rhee and he says this data was from the “early days” and was concerned about a particularly high error rate. High throughput requires a different mindset than a lot of the traditional molecular biological data we collect. A lot of times you are fighting tooth and nail to get a signal, and are concerned what you want to see was lost along the way. We still have A LOT of data here even though we only around 10% of what we corralled up to this point is mapping. A lot of what Titus Brown and colleagues do does in processing of metagenome assembly data is data reduction approaches. Maybe the ChIP-exo processing steps combined with our strigent requirements of the barcode resulted in a reduction of data that benefits us ultimately? Subsequent analysis will let us evaluate the data in this case.
Preliminary view of mappings¶
So far we have just looked at the mappings in a list. SAMtools has a
program called tview
that can help us get a sense visually of what
we have done so far.
What does tview
need? Let’s check the
manual. I’d suggest
searching tview
on the page because oddly the documentation is not
internally linked and indexed.
We see need to do yet some further file formatting to get the data in a
usable form for tview
.
In order to be able to use our most recent mapping results with
flagstat
we already converted it into a BAM file. Now we need a
sorted BAM file.
samtools sort combined_vs_YsGenome_bwa.bam combined_vs_YsGenome_bwa.sorted
Okay, check your file listings.
ls
Now that we have that sorted BAM file, let’s TRY to use that to run
tview
. Recall from the
documentation we need
a fasta formatted reference sequence too.
samtools tview combined_vs_YsGenome_bwa.sorted.bam genome.fa
Okay, so this is going to cause SAMtools to give you feedback.
[bam_index_load] fail to load BAM index.
Cannot read index for 'combined_vs_YsGenome_bwa.sorted.bam'.
This is telling us we need to format yet one more type of file
concerning the data. We need to index the data. If you search on
index
in the SAMtools
documentation you’ll
see there is a command to do this and that it is needed for commands
when region arguments are used
. Going back to the documentation for
tview
you’ll see tview
can use these. Lesson: the documentation
may not always explicitly say what you need where you might think it
would be best placed and error/feedback messages will often guide you.
Let’s index our sorted bam file.
samtools index combined_vs_YsGenome_bwa.sorted.bam
You’ll see this creates a file ending
combined_vs_YsGenome_bwa.sorted.bam.bai
with .*ai
at the end
indicating SAM/BAM index file and if you look at the samtools index
documentation you’ll the purpose is to enable fast access by tview
to the coordinate-sorted data. Creation if an index file (or lookup
table, etc.) is a common computer science solution to speeding things up
by otherwise eliminating a computationally-costly process.
Finally, now that we have everything, let’s run tview
.
samtools tview combined_vs_YsGenome_bwa.sorted.bam genome.fa
You’ll be whisked away to a sequence map. This
tutorial
has a guide to navigating. Use h
and l
to move right and left;
j
and k
for up and down. (Arrow keys seem to work too.)
Type ?
for help. You may need to expand your terminal window to see
all the way to the bottom of the help
menu.
Hit q
to exit.
Let’s go somewhere specific.
samtools tview -p chrII:278586 combined_vs_YsGenome_bwa.sorted.bam genome.fa
You can also go to a location when tview
by typing g
for
goto
and entering the location. Try with chrXII:289738
. (You may
need to expand your terminal window to see all the way to the bottom.)
Peak Predictions with MACS¶
We’ll use popular program MACS for peak prediction. The current version
is 2. and Tao Liu maintains it
here. We’ll use the sub-command
callpeak
, this being the main function of MACS. See
here for a guide to the options.
macs2 callpeak -t combined_vs_YsGenome_bwa.bam --name gal4 --gsize 1.21e7 --nomodel --shift -100 --extsize 200
--gsize
is the genome size option and needs to be set or it will use
human as default.
name
is to give the output files identifying names.
Finally, --nomodel --shift -100 --extsize 200
comes from the
documentation on Github concerning the shift
option. These are suggested
for DNAse-Seq datasets and so used here since ChIP-exo shares some
similarities.
I found same result with or without the use of the tag size option,
--tsize 35
, and so I left it out.
Note that MACS originally required the ChIP data in BED format so at lest we dodged ONE conversion.
Additionally, -outdir
is a great option to use to help you keep
organized. We are purposefully not using it today only to make things
easy for navigating on these transient machines. Obviously, this isn’t
good data management practice.
Peaks relative to known genomic features with CEAS¶
We now go onto visualizing the ChIP enrichment signals relative known genomic features. For that we’ll use CEAS, another Python package by Hyunjin Shin and Tao Liu from when they were in Xiaole Shirley Liu’s Lab. (There is a web-server version but it is limited to mouse and human and seems no longer operating.)
Probably not surprisingly at this point, running CEAS requires some additional preparation. CEAS requires the peaks to be specified in the BED format, whereas the signal is to presented in a WIG file. We have the peaks in the Bed format already. We need to convert the signal from a sorted Bam file to WIG.
Additionally, the CEAS program operates on a SQLITE database that
contains the information on the annotated regions of the genome. We need
to build that database ourselves since there is not a prebuilt one for
yeast available from the author’s site. If we provide as option ‘d
the name of the genome assembly and as option g
the annotation
table, both as the are referred to at the UCSC Genome Bioinformatics
site,
a utility included in the CEAS installation package will contact the
the UCSC Genome Bioinformatics
site
and build the database for us. In addition to indicating the genome and
annotation table, the annotation building process requires a wiggle file
that describes the genomic locations that are valid for data selection.
For example, a tiling array may only cover some parts of a genome. In
our case, we theoretically should have covered all the genome and so we
simply generate a wiggle file that covers the full length of each and
every chromosome. In the interest of time, this has already been done
for you; the file is sacCer3.wig
. You can acquire it by running the
command below on the command line of your instance or get it at github
if you are running this pipeline
locally;
information about making this file is
here.
wget https://raw.githubusercontent.com/fomightez/may2015feng_gr_m/master/sacCer3.wig
To build the annotation database run
build_genomeBG -d sacCer3 -g sgdGene -w sacCer3.wig -o sc3.db
(NOTE for anyone running this not on Amazon EC2 instances: the above command needs to access an external MySQL database and I believe may be a problem on certain networks.)
We now have a custom database sc3.db
that can be used to generate
reports with CEAS.
What file formats do we need for running CEAS?
We still need a WIG file with the Chip-Seq data. Following from the
advice of Stew on this page
Bam2Wig and this
page, we can use the sorted bam
file, combined_vs_YsGenome_bwa.sorted.bam
, to make such a file.
(Yes, this file contains the same information as BAM file used for peak
predictions with MACS2, but even related bioinformatics software will
often require files in different formats.)
samtools mpileup combined_vs_YsGenome_bwa.sorted.bam | perl -ne 'BEGIN{print "track type=wiggle_0 name=combined_vs_YsGenome_bwa description=combined_vs_YsGenome_bwa\n"};($c, $start, undef, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c\n"; };$lastC=$c;next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3;' > combined_vs_YsGenome_bwa.wig
Finally, CEAS uses the peaks gal4_summits.bed predicted with MACS, the wiggle file combined_vs_YsGenome_bwa.wig with the ChIP-Seq signal and the custom built database sgd.db:
ceas -g sc3.db -b gal4_summits.bed -w combined_vs_YsGenome_bwa.wig
The reports are generated as the file gal4_summits.pdf
and contains
several graphs and plots.
How do you view the pdf? The issue here is that it is on a EC2 instance at Amazon and not your local computer. While it is fairly straightforward to put it on your computer (I highly recommend scp), in the interest of time I have placed it in the associated Github repository. Click here or paste the link below into your browser URL bar.
https://github.com/fomightez/may2015feng_gr_m/blob/master/gal4_summits.pdf
See the legends at the bottom of here for help in interpreting the graphs.
The first plot shows 5.9% mappable on genome background because we didn’t provide a typical control sample here and it just calculates 5.9% without such input due to the way the calculation works having a cutoff of 5.9. See in the CEAS paper
where the background is the 9th order nucleotide Markov dependency estimated from the human genomic sequence. A score cutoff of Max (5,0.9 × Motif Relative Entropy) is used to call a motif a hit.
Plus it is probably moot for the ChIP-exo data used here. Recall from Rhee and Pugh, 2011
Uncrosslinked nonspecific DNA is largely eliminated by exonuclease treatment, as evidenced by the repeated failure to generate a ChIP-exo library from a negative control BY4741 strain. Therefore, use of an exonuclease makes comparisons to input DNA or mock IP moot, in that such DNA is destroyed.
Optional: Look at reads and/or peaks in IGB or IGV or SeqMonk OR UCSC genome browser¶
In most viewers the MACS2 output can be directly loaded into according to documentation. Your mileage may vary. We will not be doing this in the interest of time today.
Comparison to published data¶
Preparation¶
Conveniently, the authors and SGD made the locations they deemed binding sites available as a BED file.
You can use your browser on your local computer to get it at
http://downloads.yeastgenome.org/published_datasets/Rhee_2011_PMID_22153082/track_files/Rhee_2011_Gal4_ChIP_exo_bound_locations_V64.bed
Alternatively if you wish to stay working on the command line...
Downloading on a Mac
curl -o Rhee_2011_Gal4_ChIP_exo_bound_locations_V64.bed "http://downloads.yeastgenome.org/published_datasets/Rhee_2011_PMID_22153082/track_files/Rhee_2011_Gal4_ChIP_exo_bound_locations_V64.bed"
Downloading on a Linux instance:
wget -O Rhee_2011_Gal4_ChIP_exo_bound_locations_V64.bed "http://downloads.yeastgenome.org/published_datasets/Rhee_2011_PMID_22153082/track_files/Rhee_2011_Gal4_ChIP_exo_bound_locations_V64.bed"
Now we could compare that to our peak summits.
Before we do that, however, let’s do a little more filtering based on standards set in the paper.
From the supplemental information Data Analysis
section for the
paper Rhee and Pugh, 2011: From Set of Gal4 bound locations
Set of Gal4 bound locations. The final set of 15 Gal4 bound locations was determined to be those locations having all of the following properties: 1) peak-pair distances between -1 and 61 bp, 2) peak-pair found in ≥2 of 3 replicates, in which a replicate peak-pair was <28 bp away (midpoint to midpoint), 3) a median of ≥75 normalized tags (see note above) per peak-pair (5% of the average tag count of all 15 peak-pairs), and 4) not at telomeric, tRNA, and rDNA regions.
We’ll discuss some of this shortly, but for now let’s focus on property four. The main issue I saw in my development of this workshop was summits in the telomeric DNA regions and so let’s at least get rid of those. Using the lengths of the chromosomes that is at the beginning of several of our SAM file, it would be fairly easy to make a Python program to pick a distance from the ends of the chromosomes and remove any summits in those. However, there is a way to filter out regions that is more general and we’ll use that next.
Filter additional regions¶
Ideally we would filter on all the regions, the authors of the study also filtered on. In the interest of time though, today we’ll just focus on one. The telomeric sites.
The Rhee and Pugh, 2011 paper used genomic features to filter out results from peak-calling. We want to do this as well.
A good, general approach seems:
- get list or lists as Bed file from YeastMine
- Uses Bedtools with
intersectBed -v
to filter out those summits that are in those regions. Our example here will be telomeres.
The steps for getting the telomere Bed
file are spelled out
here.
I’ll demonstrate the process of acquiring the file fairly fast here, and then we’ll just download the result file in the interest of time. The page at the link above spells the approaches if you are doing this yourself.
wget https://raw.githubusercontent.com/fomightez/may2015feng_gr_m/master/telomere_regions.bed
Now to filter with that file. We’ll use bedtools’s intersect
function
with the -v
option to exclude those that have any overlap with the
telomere regions.
bedtools intersect -v -a gal4_summits.bed -b telomere_regions.bed > filtered_gal4_summits.bed
Now we are ready to compare.
Compare¶
Compare your filtered_gal4_summits.bed
to
Rhee_2011_Gal4_ChIP_exo_bound_locations_V64.bed
.
I am going to write the command line commands but feel free to use whatever method you want to examine and compare the files.
head -30 filtered_gal4_summits.bed
tail -15 Rhee_2011_Gal4_ChIP_exo_bound_locations_V64.bed
Interesting. While not perfect our summits do overlap with many of theirs.
Two parts from the supplemental information Data Analysis
section
for the paper Rhee and Pugh, 2011 deserve highlighting in the light of
this:
From Alignment to genome, peak calling, and data sharing
The resulting sequence read distribution was used to identify peaks on the forward (W) and reverse (C) strand separately using the peak calling algorithm....
From Set of Gal4 bound locations
Set of Gal4 bound locations. The final set of 15 Gal4 bound locations was determined to be those locations having all of the following properties: 1) peak-pair distances between -1 and 61 bp, 2) peak-pair found in ≥2 of 3 replicates, in which a replicate peak-pair was <28 bp away (midpoint to midpoint), 3) a median of ≥75 normalized tags (see note above) per peak-pair (5% of the average tag count of all 15 peak-pairs), and 4) not at telomeric, tRNA, and rDNA regions.
Note two huge differences probably contribute largely to the differences in our results:
* Largely in the interest of time, I didn't have us do the peak calling with the forward and reverse mapped reads separately and then process those peaks further as the authors describe. Since the authors went on to use the peak-pair midpoint to locate the binding site, as described in Figure S3, I figured the peak calling algorithm would largely average this out to locate the midpoint on its own and that should be good enough for our purposes today.
* As best I can tell, we were only supplied data from 1 replicate. Their properties also mention things about 3 replicates but they only posted one replicate for Gal4 data it seems. I checked all 5 sets of data linked and only see the one replicate for Gal4 mentioned. This will obviously factor into differences with the results we get in workshop as compared to the paper.
* Additional differences could arise from the different software used in the pipeline and the more limited filtering we performed. For example, the MACS 2012 guide suggests using more specialized tools for data types like DNase-seq and ChIP-exo is similar to that.
Motif discovery with MEME¶
Preparation¶
To look for motifs, let’s try to take double the read length (35 * 2)
of flanking sequence on each side of the peaks to account for reads
going either direction from the peak summits and allowing ample space
for a reasonable size DNA binding site. We’ll use bedtools slop
from
the Bedtools collection to expand the ranges upstream and downstream to
the locations of our peaks.
bedtools slop -i filtered_gal4_summits.bed -g genome.fa.fai -b 70 > gal4_summits_more.bed
If you’d like to use the Rhee and Pugh 2011 data, the command is
bedtools slop -i Rhee_2011_Gal4_ChIP_exo_bound_locations_V64.bed -g genome.fa.fai -b 29 > rhee_more.bed
Note: the genome.fa.fai
file has summarized information about the
lengths of each chromosome that bedtools slop
can use to limit the
ranges added to intervals to not extend beyond the actual length of the
chromosome sequences. (The support for .fai files for bedtools slop
was only noted among the comments at the bottom of the page
here.)
(The 29 base selection for the Rhee and Pugh Bed file was simply an arbitrary value based around the midpoint of the peak-pair distances identified in the paper. Obviously if we were doing this for actual discovery we may have wanted to try a few different values here, or better yet given the method, worked out for ourselves the peak-pair values for our data.)
The Bed
file just listed start and end locations for the peaks and
we simply expanded this window with bedtools slop
. We actually need
the sequences specified by our expanded ranges in order to provide the
MEME analysis tool with the sequences as input. We’ll use another tool
from the Bedtools collection to accomplish this.
I am going to illustrate the commands for both our data and the published data here for the next few commands. You can see how our data comes out today using the MEME tool and then check back and see how it would come out with there data if you’d like.
bedtools getfasta -fi genome.fa -bed gal4_summits_more.bed -fo gal4_summits_more.fa
If you’d like to use the Rhee and Pugh 2011 data, the command is
bedtools getfasta -fi genome.fa -bed rhee_more.bed -fo rhee_more.fa
Expand your window in which you are working as tall as you can and then enter on command line
less gal4_summits_more.fa
If you’d like to use the Rhee and Pugh 2011 data, the command is
less rhee_more.fa
Now highlight in your terminal and copy the highlighted text to the clipboard.
Open your text editor.
Paste the text you copied into a text editor program and save the file
as gal4_summits_more.fa
or rhee_more.fa
, respectively.
Now using the browser on your computer go to
here . Use the above file to
upload it where it asks for Input the primary sequences
. (Supposedly
here is the most up-to-date
version of the site. However, the Upstate network said it was
unavailable or it violated policy and has been blocked when I submitted
jobs there.)
Under Select the number of motifs
, adjust the number of motifs to
find to one.
Provide an e-mail address.
Click on ‘Start Search’ to submit your task. It will be a short wait. Your results will be available in your browser soon. No need to check email most likely.
Click on MEME html output
and view the results by hitting blue arrow
below the More
column on the next screen.
How do your results match up with the figure 3 MEME output logo from Rhee and Pugh, 2011?
How do these results match up with the structural basis of DNA binding by Gal4p?
The slides will be used to discuss the above two questions in more depth.
Why not PRECISELY same (even when using their file)? We have to imagine we were coming at this naively and trying to identify a motif. (Or least be performing a proof of concept experiment as they were here.) This was obviously just an initial first pass to see if any motif is identified. One is. This would obvious warrant further examination. In this case, we’d also have to look and see that because of the amount of flanking sequence added to each site region, some of the sites identified in the ChIP-exo data overlap with neighboring sites in the sequences submitted to MEME or are left out because the peak identification in MACS2 we used seems to squelch out neighboring sites. Thus we have probably biased this rough first pass to the best ones when faced with overlap. Today was just meant to give you a flavor for the steps involved if you were trying to use this to search for a novel motif. In a thorough investigation of the motif there’d be subsequent steps and the results seen on the left side of Figure 3 of Rhee and Pugh, 2011 are likely the result of such further analysis. The authors may have restricted the sites they submitted down to say the best match within the distances they had identified with the peak pairs. Additionally, judging by the indication to the orientation of the gene, I think they may have used the advanced option settings in MEME to limit the search to given strand and provided the upstream sequence from the gene which is not what we did here. Additionally, as touched in numerous places, their approach to identify peak pairs and sites was more detailed than what we did here, and had additional replicates and filtering based on features (tRNA and rDNA) to confirm or rule out peaks, and so the upstream results before MEME were obviously different.
(Actually, I have used the data available in the Rhee and Pugh Gal4 Bed file available from SGD to eliminate the overlaps and found I still don’t get an exact match to the Figure 3 MEME output logo. It is very, very close. However, there is a further complication. Figure 3 shows the sequence of 15 binding sites. If you count the lines of the actual data in the Gal4 Bed file, you’ll see there is only 14. Likewise, the Gal4 data in the supplemental Excel file with the publication has a heading of 15 Gal4 binding sites, but there are only 14 there. One of the Gal7 sites shown in Figure 3 is absent in the Excel spreadsheet, and hence the Bed file because the Bed file header comments says the SGD curators used supplemental data to make the Bed file. Even adjusting for that and the orientation, plus setting the motif to have to be exactly 19 bps long, I didn’t obtain exactly their motif logo, and so maybe due to a difference in other setting(s) or different software version.)